-
Notifications
You must be signed in to change notification settings - Fork 462
[BugFix]add all2all when dp_size > 1 && downgrade npu_dequant_swiglu_quant #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
af2215a
to
d6cf7a1
Compare
b2ca756
to
e0ab8e0
Compare
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
dist.all_to_all_single(gather_sizes, | ||
scatter_sizes, | ||
group=ep_group.device_group) | ||
scatter_size_list = scatter_sizes.cpu().tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may introduce serious performance regression, please note that we will change this in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gather_dim, scatter_sizes, | ||
gather_sizes) | ||
|
||
def reduce_scatter(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this if you do not use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
vllm_ascend/models/deepseek_v2.py
Outdated
attn_metadata = get_forward_context().attn_metadata | ||
if attn_metadata is None: | ||
# when profile runs, force experts load balance to avoid high memory | ||
# consumption from 1 rank. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more comments on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
3349fbd
to
30eafb9
Compare
Signed-off-by: angazenn <zengyanjia@huawei.com>
…_swiglu_quant (vllm-project#819)" This reverts commit 1e67089.
This is the best solution for A3 performance. Is there a best solution for A2 performance? |
### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR #819 when running deepseekv3 series models: 1. #819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>
…oject#897) ### What this PR does / why we need it? This PR fixes two accuracy bugs incurred by PR vllm-project#819 when running deepseekv3 series models: 1. vllm-project#819 adds `all_to_all` communication in quantized cases, but `all_gather` && `reduce_scatter` are removed in both of quantized and unquantized cases. When running unquantized deepseekv3 models with `ep_size == world_size`, the moe modules fail to communicate. Therefore, this PR adds `all_to_all` communication on unquantized situation to solve this accuracy issue. 2. Use `ep_size` rather than `dp_size` to decide whether to use `all_to_all` in moe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com>
What this PR does / why we need it?
all_to_all
communication operator to fixallgather
bugs when dp_size > 1. Besides, it adds a naive implementation of force-load-balance when doing profile runs.npu_dequant_swiglu_quant
only supports input hidden_states with dtypetorch.int32
. This tensor occupies space ofglobal_bs * seq_len * topk * hidden_size
, which might be very large asep_size
grows. Therefore we need to disable this operator and use originalswiglu
&&quantize
.Does this PR introduce any user-facing change?
No.
How was this patch tested?
By performing offline inference:
